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ABSTRACT 


Radar  Electronic  SiimMit  Measures  (ESM)  systems  are  faced  with  increasingly  (tense  and 
complex  etectromagnetic  environments.  Traditional  algorithms  fen  signal  recognition  and  analysis 
are  Mghly  complex,  computationally  intensive,  often  rely  on  heuristics,  ami  require  humans  to 
verify  and  validate  the  analysis,  hi  this  paper,  the  use  of  an  alternative  technique  —  artificial 
neural  networks  —  to  classify  pulse-to-pulse  signal  mcxlulation  patterns  is  investigated  Neural 
networks  are  an  attractive  alternative  b^use  of  their  potential  to  solve  difficult  classification 
problems  more  effectively  and  nKxe  quickly  than  conventicnial  techniques.  Neural  networks 
adiqx  to  a  problem  by  learning,  even  in  the  presence  of  noise  or  distortiem  in  the  input  data, 
wit^t  the  requirement  for  human  programming.  In  the  piqier,  the  fundamentals  of  network 
construction,  training,  behaviour  and  methcxls  to  improve  the  training  process  and  enhance  a 
network’s  performance  are  discussed  The  results  and  a  description  of  the  classification 
expmiments  are  also  presented 

rM:sum£ 

L’environnement  61ecttomagitetique  dans  lequel  doit  (^terer  te  systeme  de  Mesure  de 
Soutten  Electroniques  (MSE)  pour  signaux  radar  est  de  plus  en  plus  dense  et  complexe.  Les 
algorithmes  classiques  de  reconnaissance  et  de  traitement  de  ces  signaux  sont  complexes  et 
heuristiques;  ils  requibrent  en  gdndral  de  kmgs  temps  de  calcul  ainsi  que  rintervention  humaine 
dans  les  dtiqies  d’a^yse  et  de  validaticm.  Les  r6seaux  de  neurones  aitifictels  stmt  une  solutiem 
de  rechange  intdressante  vu  leur  capacitd  k  rdsoudre  des  problbmes  complexes  de  classification 
de  manibre  plus  efficace  et  plus  rqhde  que  les  algcvithmes  classiques.  Leur  capacitd  d’apprendre 
leur  permet  de  s’adqiter  au  probl^e  mdme  en  environnement  bimtd  ou  pour  une  distortion  des 
entr^  et  ce,  sans  Tintervention  humaine.  Dans  ce  n^port,  les  idseaux  de  neurones  aitificiels 
s(mt  utilisds  pour  effectuer  la  classification  de  signaux  selon  le  type  de  modulation  de  la 
fidquence  des  impulsions.  Dans  la  premidre  partie  du  riq>poit,  la  structure,  les  algorithmes 
d’qtprentisage  et  le  comportement  d’un  rdseau  sont  prdsentds.  On  aborde  dgalement  la  question 
de  I’accdldratitm  de  la  vitesse  d’apprentissa^.  deuxidme  partie  ddcrit  les  rdsultats  des 
classificatitms  de  signaux  effectudes  par  rdseau  de  neurones. 
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EXECUTIVE  SUMMARY 


The  purpose  of  a  radar  Electronic  Support  Measures  (ESM)  system  is  to  detect,  determine 
signal  parameter  values,  and  identify  radar  emitters  in  die  electromagnetic  environment  Research 
initiated  at  the  Defence  Research  Establishment  Ottawa  (DREO)  has  supported  the  develqpment 
of  hardware  and  software  for  the  next  generation  naval  ESM  system.  This  project  is  called 
CANEWS  2. 

CANEWS  2  incorporates  a  number  of  c(»nplex  signal  processing  algorithms  designed  to 
classify  radar  signals.  One  of  these  determines  the  pulse-to-pulse  modulation  pattern  of  an  agile 
signal.  The  objective  of  this  paper  is  to  investigate  the  technology  of  neural  networks  and  their 
applicability  to  this  signal  classification  problem. 

The  potential  offered  by  artificial  ne^ural  networks  has  sparked  a  great  deal  of  interest  in 
the  research  community.  Although  traditional  algorithms  have  been  very  successful  at  tasks  that 
ate  characteriaed  by  formal  logical  rules,  they  have  had  little  success  at  tasks  which  are  difficult 
to  characterize  diis  way.  These  are  often  tasks  that  the  human  brain  performs  well,  such  as 
pattern  recognition  and  common-sense  reasoning.  Among  the  attractive  features  of  neural 
networks  ate  their  ability  to  team,  even  in  the  presence  of  noise  or  distortion  in  the  input  data, 
the  ease  of  maintenance,  high  performance,  and  the  potential  to  solve  difficult  classification 
problems  m<xe  effectively  than  conventional  techniques. 

The  basic  component  of  an  artificial  networic  is  the  artificial  neuron.  A  netwoiit  is  created 
by  connecting  a  number  of  neurons  together.  Each  neuron  contributes  to  die  overall  function 
perfixmed  by  the  network  by  taking  values  fiom  the  nodes  on  its  input  side  and  calculating  a 
value  which  it  passes  on  to  the  nodes  on  its  output  side.  A  special  set  of  nodes  are  designated 
as  initial  input  nodes;  each  node  receives  a  single  value  from  a  vector  representing  the 
characteristics  of  an  input  to  the  network.  These  values  are  transformed  as  they  propagate 
through  the  netwtxk  —  the  final  destination  being  a  set  of  nodes  designated  as  the  network’s 
ouqiut  The  output  vector  produced  by  these  nodes  represents  a  function  of  the  initial  input 
vector. 

This  paper  covers  the  fundamentals  of  neural  network  construction,  training,  and 
behaviour,  then  deals  widi  problems  pertaining  to  a  particular  type  of  network  called  the 
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badi-propagation  netwoilE.  Back-pn^MigiUion  netwmks  are  the  most  commonly  used  networks 
for  practical  {plications.  A  number  of  pragmatic  considerations  in  the  design  of  such  netwcnks 
to  improve  their  learning  capacity  and  to  enhance  their  performance  are  described.  A  set  of 
neural  networks  to  classify  signal  modulatitm  patterns  were  initially  designed  to  take  the 
frequency  values  from  a  pulse  uain  as  input  By  designing  new  networks,  that  tocA  input  values 
to  which  a  Fourm  transform  had  been  qiplied,  performance  was  greatly  improved.  The  results 
obtained  from  these  experiments  have  led  the  authors  to  conclude  that  neural  networks  are  a 
promising  techmdogy  in  the  area  of  radar  signal  classification. 
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1.0  INTRODUCTION 


The  purpose  of  a  radar  Electronic  Support  Measures  (ESM)  system  is  to  detect,  determine 
signal  parameter  values,  and  identify  radar  emitters  in  the  electromagnetic  environment.  Research 
initiated  at  the  Defence  Research  Establishment  Ottawa  (DREO)  has  supported  the  develqmient 
of  hardware  and  software  ftnr  the  next  generation  naval  ESM  system.  This  project  is  called 
CANEWS2. 

CANEWS  2  incotpmrates  a  number  of  signal  processing  algorithms  designed  to  classify 
radar  signals.  One  of  these  determines  the  pulse-to-pulse  modulation  pattern  of  an  agile  signal. 
The  goal  of  this  paper  is  to  investigate  the  technology  of  neural  networks  and  their  applicability 
to  this  signal  classification  problem. 

1.1  Signal  Classiffcation  in  CANEWS  2 

The  approach  taken  by  the  CANEWS  2  software  to  identify  radar  emitters  is  to  first 
deinterleave  the  detected  signals  into  separate  tracks.  For  an  individual  track,  a  classification  is 
perftxmed  on  each  of  the  signal’s  parameters:  pulse  width,  radio  frequency  of  each  pulse  (RF), 
pulse  repetitum  interval  (PRI).  and  scan  panem.  The  classification  process  reveals  stractural  and 
statistical  attributes  for  each  parameter.  Identitication  then  finds  radars  in  an  emitter  library 
whose  known  characteristics  match  the  observed  characteristics  on  a  parameter  by  parameter 
basis. 

Of  particular  interest  are  the  algorithms  used  to  determine  the  structural  and  statistical 
attributes  of  the  parameters.  For  each  parameter,  CANEWS  2  maintains  a  hierarchy  of  all  of  its 
possible  structural  characteristics.  Entries  at  the  same  level  in  the  taxonomy  represent  mutually 
exclusive  classifications,  entries  below  a  particular  entry  represent  s|)ecializations  of  that  entry. 
For  example,  an  RF  signal  is  divided  into  two  mutually  exclusive  classifications,  either  pulsed 
ct  continuous  wave.  Within  the  pulsed  classification  are  two  sub-classifications;  the  pulses  either 
maintain  a  constant  fiequency  or  they  vary  from  pulse  to  pulse.  The  variable  classification  in 
turn  has  two  sub-classifications;  the  variation  is  either  periodic  or  random;  and  of  these  structural 
possibilities,  the  signal  could  either  cover  an  entire  range  of  RF  values  in  a  continuous  fashion, 
OT  merely  cover  a  range  at  discrete  points. 
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A  set  of  statistical  values  is  maintained  for  each  structural  classification.  Fot  example, 
in  the  case  of  a  pulsed,  variable,  periodic,  continuous  RF  signal,  the  set  includes;  die  limits  and 
the  mean  of  the  RF  values  observed,  a  description  of  the  pulse-to-pulse  modulation  pattern  (ramp, 
sinusoidal,  triangular, ...),  and  the  length  of  the  period  of  the  pattern.  In  the  case  of  a  random 
continuous  RF  signal,  the  set  includes;  the  limits  and  mean  as  before,  as  well  as  a  descripticMi 
of  the  probability  distribution  of  the  RF  values  (Gaussian,  uniform,  triangular,  ...).  The 
algorithms  to  ctnnpute  these  values,  in  particular  the  pulse-to-pulse  modulation  pattern  and  the 
probability  distribution,  have  a  high  order  of  complexity.  To  determine  the  type  of  the 
modulation  pattern,  a  match  is  made  between  a  normalized  set  of  the  observed  pulses  and  a  sine 
wave  to  obtain  a  "goodness  of  fit"  based  on  heuristics.  This  process  is  repeated  for  the  other 
possible  modulation  patterns.  The  match  with  the  highest  goodness  of  fit  determines  the 
modulation  type.  The  already  high  degree  of  complexity  of  this  algorithm  is  only  compounded 
when  additional  modulation  patterns  are  included,  as  the  order  of  complexity  of  the  algorithm 
is  linear  with  respect  to  the  number  of  modulation  types.  This  statement  can  also  be  made  in 
reference  to  the  determination  of  the  probability  distribution  for  the  random  classitication. 

The  task  of  computing  the  pulse-to-pulse  modulation  pattern  will  be  used  to  test  the 
applicability  of  neural  netwoiics  to  signal  classification. 

1.2  Neural  Networks 

The  potential  offered  by  artificial  neural  networks  has  sparked  a  great  deal  of  interest  in 
the  research  community  whose  members  are  eager  to  apply  it  within  their  respective  disciplines. 
Although  traditional  algorithms  have  been  very  successful  at  tasks  that  are  characterized  by 
formal  logical  rules,  they  have  had  little  success  at  tasks  which  are  difficult  to  characterize  this 
way.  These  are  often  tasks  that  the  human  brain  performs  well,  such  as  pattern  recognition  and 
common  sense  reasoning.  Some  of  the  features  of  neural  networks  that  are  attractive  on  a 
theoretic  level  are; 

Learning;  The  ability  to  change  behaviour  based  on  sample  inputs  (and  possibly 
desired  outputs)  rather  than  through  program  changes. 

Generalization;  The  ability  to  recognize  that  an  input  belongs  to  a  certain  class 
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in  spite  of  noise  <»:  distortion. 

AbstTMtion;  The  ability  to  extract  common  features  of  inputs  to  create  separate 
classifications. 

These  features  lead  to  a  number  of  practical  advantages: 

Fault  tolenuice:  Some  processing  capability  can  exist  even  if  part  of  the  network 
is  destroyed. 

Ease  of  maintenance:  Programming  is  not  required  to  change  the  behaviour  of 
the  network;  adaptation  to  new  variations  of  input  data  is  accomplished 
through  retraining. 

High  performance  operation:  Neural  networks  are  very  well  suited  to  parallel 
computation.  Special  hardware  with  massively  parallel  architectures  are 
being  designed  to  exploit  this  feature. 

Ease  of  integration:  With  the  introduction  of  neural  networks  on  chips,  existing 
and  developmental  systems  can  readily  take  advantage  of  this  new 
technology. 

The  ^plications  for  which  neural  networks  are  particularly  suited  include  signal  filtering, 
pattern  recognition,  lioise  removal,  classification,  data  compression,  image  processing  and 
auto-associative  recall  or  synthesis.  Neural  networks  that  are  currently  in  use  or  in  the  research 
phase  include:  interpreting  medical  images  (for  cancerous  cells),  detc;:ting  bombs  in  suitcases, 
controlling  the  rollers  in  steel  mills,  predicting  currency  exchange  rates,  and  steering  autonomous 
land  vehicles.  Pcriiaps  surprisingly,  neural  networks  are  very  poor  at  other  kinds  of  computation 
such  as  arithmetic  and  syllogism  (if  ...  then  ...  else  ...  reasoning). 

This  paper  is  concerned  with  classification  of  radar  signals.  Other  authors  have 
investigated  similar  problems  using  neural  networks.  Anderson  [1]  studied  the  classification  of 
individual  pulses.  He  used  a  network  to  group  pulses  with  similar  parameters  together  to  identify 
emitters.  He  makes  the  interesting  point  that  while  many  clustering  algorithms  are  known,  and 
there  is  no  reason  to  expect  a  neural  network  to  be  any  better  than  the  best  traditional  algorithm, 
the  fact  that  neural  networks  can  be  both  fast  and  tolerant  of  noise  make  them  particularly  suited 
to  his  application.  Coat  and  Hill  [2]  attempted  emitter  identification  by  using  a  neural  network 
to  analyze  the  inner  structure  of  single  pulses.  Willson  [14]  investigated  the  use  of  neural 
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networks  to  perfonn  deinterleaving  (to  sort  individual  pulses  into  "bins”  associated  with  the  radar 
emitte'  that  each  pulse  came  from)  and  perform  radar  identification  based  on  already  classified 
pulse  parameter  information.  Brown  [3]  employed  a  neural  network  to  classify  different 
probability  distiibutitms  of  pulsed  random  agile  RF  signals. 

2.0  THEORY 

The  basic  component  of  an  artificial  neural  network  is  the  artificial  neuron.  By 
competing  a  number  of  neurons  together,  a  network  is  created. 

2.1  The  Artificial  Neuron 

An  artificial  neuron,  from  here  on  referred  to  as  a  neuron  or  a  node,  is  represented  in 
Hgure  1. 


A  neuron  receives  information  from  its  inputs  and  passes  on  information  via  its  outputs. 
The  inputs  to  a  neuron  are  supplied  either  externally,  i.e.  from  data  generated  outside  of  the 
network,  or  internally  from  the  output  of  another  neuron  in  the  network.  In  addition  to  the  N 
inputs,  a  bias  B  is  added,  the  utility  of  which  will  be  seen  later.  Associated  with  each  of  the 
inputs  is  a  weight  which  diminishes  or  reinforces  this  particular  input  The  bias  has  an 
associated  weight  as  well.  The  net  input  to  a  neuron  is  the  sum  of  the  N  inputs  and  the  bias 
adjusted  by  the  weights  according  to  the  following  formula: 
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To  produce  dw  final  ouq>ut  y,  the  net  input  is  further  process*^  by  a  (usually)  non-linear 
transformation^  called  a  transfer  Junction  or  activation  function.  The  most  common  transfer 
functions  (siq),  ramp,  and  sigmoid)  are  represented  in  Figure  2. 
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a)  step 


b)  ramp 


Figure  2:  Common  Transfer  Functions 


c)  Sigmoid 


A  displacement  of  these  functions  along  the  x  axis  can  be  obtained  with  the  bias 
(generally  set  to  1)  multiplied  by  its  weight  —  a  positive  weight  resulting  in  a  displacement  of 
the  function  to  the  right,  a  negative  weight  resulting  in  a  displacement  of  the  function  to  the  left 
These  three  functions  can  also  be  defined  to  be  symmetric  with  respect  to  the  x  axis  with  the 
function  taking  on  values  between  -1  and  1  in  the  case  of  the  step  and  the  ramp,  and  between 
-Vi  and  Vi  in  the  case  of  the  sigmoid. 

Note  that  any  function  defined  over  the  input  domain,  that  is  monotonically  increasing, 
has  a  lower  and  upper  limit,  can  be  used  as  a  transfer  function.  With  its  binary  output,  the  step 
is  die  simplest  of  these  functions.  Finally,  some  networks  modify  the  behaviour  of  a  node  by 
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generating  different  outputs  based  on  whether  the  value  generated  by  the  transfer  functitm 
exceeds  a  threshold. 

While  these  transfer  functions  are  the  most  common,  they  are  not  necessarily  adequate 
for  all  aj^licadons.  Some  applications  use  a  training  algorithm  that  relies  on  die  differentiability 
of  die  transfer  function.  The  step  and  the  ramp  functions,  being  non-differentiable  at  the  points 
of  transition,  are  used  in  less  sqihisticated  training  algorithms. 

2J  Neural  Network  Types 

In  the  previous  section,  structure  of  the  neuron  was  described.  This  section  deals  with 
the  physical  organization  of  the  neurons,  and  the  pattern  of  connections  that  exists  between  them. 

The  neurons  are  usually  organized  in  layers.  The  number  of  layers  in  a  network  varies 
fnxn  a  single  layer  to  multiple  layers.  In  networks  containing  three  or  more  layers,  the  middle 
layers  are  called  hidden  layers  as  they  have  no  direct  connection  to  the  outside  of  the  network. 
The  first  layer  of  the  network,  called  the  input  layer,  receives  data  simultaneously  at  each  of  its 
nodes,  hence  the  notion  of  an  input  vector.  At  the  output  layer  of  the  network,  the  ouqiuts  are 
generated  in  parallel  to  yield  an  output  vector. 

The  nodes  of  a  network  can  be  connected  in  different  patterns  (see  Figure  3): 

Feed-forward:  The  flow  of  information  is  transmitted  from  layer  to  layer  from 
the  input  layer  to  the  ouq)ut  layer,  with  no  feedback  between  layers  or 
nodes.  There  are  no  connections  between  nodes  occurring  in  the  same 
layer,  so  no  information  is  exchanged  between  nodes  in  any  given  layer. 

Lateral:  Every  node  is  connected  to  every  other  node  in  the  network.  Recursive 
connections  are  permitted  as  well  (a  node  can  be  connected  to  itself). 

Since  there  are  no  layers  per  se,  networks  of  this  type  are  often  referred 
to  as  memo-layered  networks. 

Feedforward/Feedbackward:  During  the  learning  or  training  phase,  information 
is  transmitted  from  one  layer  to  the  next  in  both  directions  (from  the  input 
layer  to  the  output  layer  and  vice-versa).  The  number  of  weights  is 
therefore  double  that  of  feed-forward  networks. 
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Coopcrativc/Compctitive:  This  is  a  combination  of  the  previous  two  types,  Le., 
the  connectkms  between  nodes  in  adjacent  layers  can  be  in  bodi  directions 
as  well  as  between  the  nodes  in  any  given  layer. 


Figure  3:  Network  Types  (from  Handbook  of 
Neural  Computing  [10]) 


For  certain  problems,  a  single  network  may  either  be  inadequate  or  ineffective  in 
Iffoducing  a  satisfactory  solution.  In  current  wmk,  a  common  approach  is  to  break  a  problem 
into  smaller  tasks  which  are  processed  by  connecting  several  netwoiks  together  to  work  serially 
(in  a  cascade)  ot  in  parallel. 

23  Training 

The  training  process  of  a  neural  netw(»k  is  carried  out  by  presenting  a  sequence  of 
training  examples  or  input  vectors  in  a  random  order  to  the  network.  The  weights  in  the  netwmk 
are  updated  after  each  presentation  according  to  the  network’s  learning  algorithm.  This  process 
is  repeated  until  a  pre-determined  criterii.  is  satisfied.  For  example,  the  criteria  may  be  the 
minimization  of  an  error  function,  where  the  error  represents  the  difference  of  the  actual  outputs 
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and  die  eiqiected  ouqmts.  The  goal  upon  finishing  the  training  period  is  to  have  the  netwtnk 
produce  the  anticipated  result  ftv  any  input  not  contained  in  the  set  of  training  vectors.  In  the 
case  of  supervised  trauiuig,  a  pair  of  vect(»s  is  presented  to  the  network:  an  input  vector  and 
a  coneqponding  ouqnit  vector.  When  the  network  is  presented  with  an  input  vecttv,  the 
generated  ouqnit  should  match  the  corresponding  output  vector.  If  the  input  and  output  trai 
vectors  are  the  same,  the  training  process  is  said  to  be  auto-associadve,  otherwise  it  is  said  i 
hetero-associative.  There  is  also  an  inteimediate  form  of  training  called  reinforcement  in  which 
die  networit  is  simply  infoimed  if  the  output  is  correct  or  not.  In  the  case  of  unsupervised 
trainings  only  the  input  vectm*  is  supplied. 

In  cmitrast  to  other  systems,  a  characteristic  attribute  of  neural  networks  is  their  ability 
to  sttne  information  in  the  form  of  weights  on  the  connections.  Before  training,  the  values  of 
the  weights  are  random,  but  as  training  proceeds,  the  network  stores  the  information  by  changing 
the  values  of  its  weights.  This  information  is  distributed  throughout  the  set  of  nodes.  The 
weights  therefore  represent  the  netwcnk’s  current  state  of  knowledge.  Each  individual  training 
vectw  influences  the  entire  set  of  weights  and  each  individual  weight  depends  on  the  entire  set 
of  training  vectors. 

23.1  Training  Algorithms 

A  training  algorithm  specifies  the  way  in  which  weights  in  the  network  are  to  be  updated 
in  order  to  improve  the  network’s  perftvmance.  In  the  case  of  supervised  training,  the  algorithm 
is  usually  a  variation  of  one  of  the  following  three  types  [7]: 

Hebbian  learning:  The  weight  of  an  input  connection  to  a  node  is  increased  if 
the  value  associated  with  die  input  connection  and  die  value  of  the  node’s 
output  are  both  large.  This  algorithm  is  designed  to  simulate  the 
phenomena  of  learning  through  repetition  and  the  fonnation  of  habits. 

Delta  rule  learning:  The  input  connection  weights  of  a  neuron  are  adjusted  so 
as  to  reduce  the  error  between  the  actual  output  and  the  desired  ou^ut  of 
the  neuron. 

Competitive  learning:  The  nodes  in  a  layer  compete  amongst  themselves.  The 
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node  that  generates  the  greatest  output  fw  a  given  input  modifies  its 
weights  to  generate  an  even  greater  output  The  other  nodes  in  die  layer 
modify  their  weights  to  decrease  their  output. 

Unsupervised  learning  is  less  commonly  used.  The  basic  idea  is  that  the  network 
organises  itself  so  that  each  ouqiut  node  responds  (generates  a  large  ouqiut)  to  a  set  of  input 
vectors  possessing  a  particular  characteristic. 

2.4  Sttmninry 

In  tins  chapter,  the  micro-structure,  the  macro-structure  and  the  training  algtnithms  of 
artificial  neural  netwtvks  have  been  described.  In  a  continuous  attempt  to  improve  the 
performance  of  neural  netwnks,  researchers  have  developed  a  wide  variety  of  networks:  the 
Perceptron,  die  Madaline,  the  Adaline,  Brain-State-in-a-Box,  Hq[)field  nets,  back-propagati<Hi 
networks,  self  oganizing  networks,  Boltzman  machines,  and  netwcnks  that  apply  adaptive 
testmance  theory,  to  name  but  a  few.  Among  this  vast  choice  of  networks,  back-pnqiagation 
networks  are  without  doubt  the  most  popular  and  the  most  commonly  used.  This  is  due  to  the 
simplicity  of  their  training  algorithms  and  their  effectiveness  in  handling  an  ever  growing  number 
of  practical  applications.  Training  with  feedback  (back-propagation;  will  be  the  subject  of 
discussitm  in  the  following  chqiter. 

3.0  BACK-PROPAGATION 

3.1  Background 

Back-prt^iagatkHi  (or  feedback)  training  is  used  in  networks  containing  multiple  layers: 
an  input  layer,  an  output  layer,  and  one  or  two  hidden  layers.  More  than  two  hidden  layers  could 
be  used  if  desired,  but  it  has  been  proven  that  no  advantage  is  to  be  gained  in  doing  so  [10].  The 
nodes  in  die  input  layer  have  a  linear  transfer  function  which  distributes  the  input  values  to  all 
of  die  nodes  in  the  next  layer.  For  all  subsequent  layers,  including  the  output  layer,  the  transfer 
functions  of  the  nodes  are  non-linear  and  generally  sigmoidal  in  nature.  (As  will  be  seen  later, 
back-propagation  learning  relies  on  the  uniform  differentiability  of  the  transfer  function.  The 
sigmoid  function  satisfies  this  constraint.)  The  connections  are  strictly  feed-forward  and  the 
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training  is  supervised. 

The  aim  of  the  back-piq[>agation  learning  rule  (often  refened  to  as  the  Generalized  Delta 
Ruk),  is  to  minimize  the  difference  between  the  actual  output  and  the  desired  ouqmt,  of  a 
network  for  die  entire  set  input  oaining  vectws.  To  accomplish  this,  training  vectors  are 
presented  one  by  one  to  the  network,  and  for  each  one,  the  error  is  minimized  by  adjusting  all 
of  the  wdghts  in  tlw  network.  In  effect,  all  of  the  nodes.  Including  their  input  weights,  are 
jdndy  responsible  for  the  global  error.  The  adjustment  is  based  on  the  gradient  method  (a 
method  commonly  used  in  t^timization  problems).  This  method  consists  of  stepping  along  the 
error  function  to  be  minimized  in  the  opposite  direction  of  die  gradient,  i.e.,  in  the  direction 
where  the  value  of  die  function  decreases.  For  non-linear  problems,  as  is  the  case  here,  the 
method  must  be  used  iteratively  until  convergence  is  attained.  The  number  of  iteratitms,  or 
presentations  necessary  before  convergence  to  a  minimum  point  is  obtained  is  extremely  variable, 
being  typically  anywhere  between  1(X)  to  lOCXX).  The  training  period  required  using  the 
back-propagati<m  technique  is  in  general  a  lengthy  one,  which  is  the  principle  weakness  of  the 
technique.  Another  problem  inherent  in  this  algorithm,  is  that  there  is  no  guaranty  that  the 
minimum  value  found  is  indeed  the  global  minimum,  and  in  fata  is  often  just  a  local  minimum. 
Furthermore,  there  is  no  guaranty  that  the  global  minimum  of  the  error  function  is  actually  zero. 
It  is  incumbent  upon  the  user,  therefore,  to  verify  that  the  minimum  found  is  satisfactray  for  his 
particular  qiplication. 

Oice  the  training  process  is  terminated,  the  performance  of  the  netwmk  is  evaluated  with 
a  set  of  test  cases.  This  set  of  test  vectors  must  be  completely  disjoint  from  the  set  of  training 
vectors,  as  die  network  may  provide  excellent  results  for  the  training  vectors  but  yield  mediocre 
results  for  other  vectors.  This  would  indicate  that  the  network  has  only  learned  specific  cases, 
rather  than  acquiring  the  ability  to  generalize  (interpolation  for  new  input  vectors  becomes 
impossible).  This  phenomena  can  arise  if  the  netwoik  is  over-trained. 

32  Output  Layo*  Learning  Rule 

The  goal  of  back-propagation  learning  is  to  adjust  the  weights  in  the  netwrak  so  as  to 
minimize  the  errw  function.  This  is  done  by  comparing  the  output  vector  with  the  desired  output 
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vector  and  adjusting  the  weights,  starting  with  the  output  nodes  and  working  backwards  to  die 
input  nodes.  In  a  network  with  N  inputs  and  L  hidden  nodes  arranged  in  a  single  layer  and  M 
oaqmts,  the  error  to  be  minimiaaed  for  each  training  vector  is  the  sum  of  the  squares  of  the 
ouqNits.  Therefore,  the  global  error  for  a  training  vectrv  p  is  given  by: 

where  2^  is  the  error  for  the  vector  p  at  the  output  node  k  between  the  expected  ouqiut  value  y^ 
and  die  actual  ouqiut  value 

To  determine  die  relative  resprmsibility  and  the  direction  in  which  each  weight  will  be 
adjusted,  the  gradient  of  the  error  V£^  with  respect  to  the  weights  (weights  associated  with 
the  connection  between  node  J  and  node  k)  is  calculated.  Each  weight  in  the  ouqiut  layer  is 
adjusted  propoitionally  in  the  negative  direction  of  the  partial  derivative  of  the  error  with 
respect  to  this  weight: 

-  Ov*  ® 

y-i 

where  is  the  net  input,  ft'  the  derivative  of  the  transfer  function,  ^  the  bias  of  the  output 

node  k  and  the  ouqmt  of  the  node  7  in  the  hidden  layer.  Note  that  because  the  partial 
derivatives  make  use  of  die  derivative  of  the  transfer  function,  this  function  must  be  differentiable 
for  all  values.  The  new  weight  is  obtained  by  adding  an  amount  proportitHial  to  dw  partial 
doivative  to  die  former  wei^t  l^th  a  constant  of  propartitmality  q,  the  adjusted  weight  is 
obtained  by  using  the  following  formula  (see  [4]  for  the  complete  derivation): 

*  1)  •  + iioj.  -  <“> 
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Finally,  by  defining  a  value  one  obtains 

«J(j+ 1)  -  «J(J) +  n5jtl^ 


(5) 


33  Hidden  Layer  Learning  Rule 

As  the  error  also  dqiends  on  the  weights  associated  with  the  hidden  layers  of  the  network, 
similar  calcnlttioas  to  those  done  for  the  weights  associated  with  the  output  layer  must  be  done 
fbr  these  weights  as  well  The  eiror  for  the  hidden  layer  can  be  expressed  as  a  function  of 
die  outputs  of  the  hidden  layer 

Using  this  equation  to  calculate  the  gradient  with  respea  to  the  weights  in  the  hidden 
laym*,  the  fcdlowing  equation  is  obtained: 

^  II 

^  •  ’E  O'/*  -  0^ 

duji  *-i 

If  an  adjustment,  proportional  to  the  partial  derivatives  with  a  ctxistant  of  pn^xxtionality 
equal  to  Ti  is  made,  then  the  value  of  the  adjusted  weight  is  given  by  (see  [4]  for  the  ctmiplete 
derivation): 

/  "  / 

«J(f  ♦  1)  •  «J(f) + 11/5*  O'j*  - 

ik*l 

-  +  n^*  */*"« 

*•1 

Notice  dut  the  new  5  depends  on  all  of  the  6^  of  the  output  layer. 


(10) 

In  the  case  of  a 
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(11) 


network  contnining  more  than  one  hidden  layer,  the  equation  would  be  similar,  except  that  the 
new  5  would  dq)end  on  the  of  the  second  hidden  layer  instead  of  the  ouq)ut  layer.  This  is 
why  the  adjustmrat  of  the  weights  begins  at  die  output  layer,  tlien  continues  from  layer  to  layer 
in  die  o^xisitB  direction  from  the  usual  flow  of  information  (hence  the  term  back-propagation). 

3.4  Internal  Behaviour 

In  the  process  of  iterative  training,  the  nodes  in  the  hidden  layers  detea  different 
characteristics  found  in  the  input  vector  in  the  sense  that  their  ouqiuts  are  activated  (ouqmt  value 
is  large)  if  diey  detea  a  particular  characteristic  and  are  deactivated  (output  value  is  small) 
odierwise.  Since  there  are  usually  fewer  hidden  nodes  than  input  nodes,  this  phase  of  training 
is  comparabte  to  a  reduction  in  the  numba  of  dimensions  of  the  input  space.  If  the  network  is 
used  for  the  purposes  of  classification,  the  nodes  in  the  output  layer  arrange  themselves  to 
partition  the  ouqiut  space  into  separate  regions  based  on  the  characteristics  supplied  by  the  hidden 
nodes.  In  general,  each  separate  region  represents  a  distinct  class  of  the  inputs. 

When  training  is  finished,  the  test  vectors  are  presented  to  the  netwoik.  The  networic 
interprets  these  new  vectors  by  detecting  the  absence  or  presence  of  characteristics  already  seen 
in  the  training  vectors.  If  a  good  generalization  has  been  obtained,  and  a  test  vector  shares 
common  characteristics  with  a  class  of  training  vectors,  then  the  network  provides  excellent 
results.  The  netwoik  can  respond  to  new  test  vectors  by  interpolating  a  response  from  tire  set 
of  traiiting  vectors.  For  example,  if  the  network  generates  an  output  A'  from  a  training  vector 
A  and  an  ouqiut  C'  from  a  training  vector  C,  tlren,  if  B  is  "between"  A  and  B,  the  network  will 
generate  B'  between  A'  and  C'.  The  network  does  not  in  general  provide  good  results  in  the 
context  of  extrapolation.  The  netwoik  does  na  learn  features  when  it  has  never  been  given 
examples  coitaining  these  features.  Note  that  an  addition  of  noise  is  considered  to  be  an 
interpolatitm  and  not  an  extrapolation  as  the  prime  characteristics  are  stiU  present  and  the  noise, 
only  adds  superfluous  information  that  is  removed  by  the  networic. 
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33  Varialioiis  of  Back^propagation  Training 

One  of  the  principal  drawbacks  of  the  back-propagation  technique  is  its  slow  training 
time.  One  oS  die  ways  to  accelerate  the  learning  time  is  to  use  a  large  learning  coefficient  (the 
value  Ti  in  equaritm  6  in  section  2.2  and  in  equation  11  in  section  2.3).  A  large  learning 
coefficient  in  theoiy,  permits  faster  training  times,  since  the  steps  taken  in  the  t^iposite  direction 
of  the  gradient  are  greater.  This  has  the  potential  drawback,  however,  of  causing  oscillation  of 
the  total  error  when  the  curvature  of  the  error  surface  is  great  As  the  algorithm  assumes  that 
the  surface  is  locally  linear,  the  steps  taken  will  continuously  overstep  the  global  minimum.  A 
small  learning  coefficient  stabilizes  this  process,  but  results  in  very  long  training  periods.  In 
addition,  it  increases  the  likelihood  that  the  algmithm  will  become  trapped  in  a  local  minimum. 
To  alleviate  these  problems,  variations  of  the  original  algorithm  have  been  develqied. 

33.1  Varying  the  Coeffidoits 

One  can  acceterate  the  training  process  by  varying  the  value  of  the  learning  coefficient 
By  starting  with  a  large  value  fcv  t),  large  steps  can  be  taken  to  quickly  move  towards  the 
minimum  of  the  error  function  and  to  avoid  being  trapped  in  a  local  minima.  As  training 
progresses,  the  value  of  ii  can  be  lowered  to  prevent  oscillations  about  the  global  minimum. 

333  The  Momentum  Method 

A  commcMi  method  to  speed  up  learning  is  die  momentum  method.  As  its  name  indicates, 
the  momentum  method  consists  of  adding  momentum  in  the  direction  of  the  weight  change.  In 
calculating  a  new  adjustment  to  the  weights,  it  takes  into  account  the  previous  adjustment 
Therefore  a  fraction  a  of  the  previous  step  is  added  to  the  formula  to  produce  the  following 
equation: 

aifi  *  1)  •  atfi)  * 


where  a  is  between  0  and  1.  The  new  term  acts  as  a  low-pass  filter  on  the  weight  error  and 
reduces  the  oscillation  of  the  network’s  global  error,  because  it  favours  a  continuation  of 
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movements  in  the  same  direction.  With  this  method,  faster  rates  of  learning  are  obtained,  while 
keeping  the  teaming  coefficient  ti  small  at  the  same  time. 

3JJ  Cumulative  Update  of  Weights 

Cumulative  back-prc^gation  is  another  method  that  can  improve  the  rate  of  convergence 
of  the  algorithm.  It  consists  of  accumulating  the  error  for  several  pairs  of  inputA>uq)ut  training 
vectors  before  updating  the  weights.  When  a  single  update  is  made,  the  error  function  is  reduced 
for  that  particular  pair  only.  The  overall  error  function  may  actually  increase.  A  gld)al  update, 
on  the  other  haml,  guarantees  a  reduction  in  the  overall  error  function.  Caution  must  be 
exercised  diough;  the  number  of  calculations  greatly  increases  with  the  number  of  accumulated 
training  pairs.  The  benefits  of  this  technique  are  lost  if  the  number  of  pairs  is  too  large. 

3.6  Summary 

This  chapter  has  described  the  back-propagation  teaming  rule,  its  internal  behaviour,  and 
three  methods  for  accelerating  the  training  process.  The  next  chapter  covers  a  number  of 
practical  consideraticms,  that  all  neural  networics  builders  must  deal  with  when  constracting 
networks  to  effectively  solve  concrete  problems. 

4.0  PRACTICAL  CONSIDERATIONS 

Two  important  practical  concerns  in  the  design  of  a  network  are: 

Convergence  of  the  error  function:  A  problem  that  can  occur  during  the 
training  period  that  will  prevent  convergence  is  node  saturation.  For 
transfer  functions  such  as  the  sigmoid  that  attain  their  minimum  and 
iruodmum  values  asymptotically,  a  large  input  at  a  node  will  be  mapped 
into  a  regitm  of  the  sigmoid  whose  derivative  is  virtually  zero.  If  the 
derivative  is  zero,  the  teaming  adjustment  will  be  zero,  which  implies  that 
the  node  ceases  to  learn.  When  this  happens,  the  node  is  said  to  be 
saturated.  If  too  many  nodes  become  saturated  during  training,  teaming 
could  simply  stop  before  a  minimum  in  the  error  funcdon  can  be  found. 
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The  speed  of  convergence  is  also  an  important  consideration.  Stmie 
techniques  for  accelerating  the  training  period  fen:  back-propagatitm  woe 
described  in  the  fnevious  ch2q)ter. 

Network  perltomiaiioe:  It  is  important  that  the  network  petfemn  efiBciently  and 
effectively  on  real  inputs  when  the  training  period  is  finished. 

There  are  a  variety  of  techniques  that  address  these  network  design  issues.  These  are 
described  in  the  following  sections. 

4.1  Pre-treatment  Input  Data 

Raw  data  is  rarely  presented  to  a  network  without  some  fmm  of  pre-treatment  In  certain 
cases,  to  ensure  convergence  of  the  error  function,  a  simple  ntnmalization  of  the  input  and  outyut 
vectors  is  performed  to  mq)  the  data  values  to  within  a  narrow  zone  (outside  of  the  asymptotic 
regions  ci  the  transfer  function).  Fm*  example,  input  values  should  be  normalized  to  lie  within 
the  range  [-1,  1]  and  the  output  training  values  to  lie  within  [0,  1].  Beeman  and  Skapura  [4] 
suggest  an  addititmal  inecaution  when  the  transfer  function  of  dte  outyut  nodes  is  sigmoidal: 
narmalizB  the  outyut  training  values  m  [0.1, 0.9]  rather  than  [0.0, 1.0]  to  avoid  saturation  of  the 
nodes,  since  die  values  of  0.0  and  1.0  can  only  be  obtained  asymptotically  at  the  output  of  this 
transfer  function. 

In  other  cases,  the  pre-treatment  of  the  input  data  requires  mme  attention.  A 
transformaticHi  may  be  necessary  to  eliminate  insignificant  variations  and  superfluous  details 
(translations,  rotations,  deformities,  ...)  while  at  the  same  time  accentuating  the  pertinent 
informatimi.  Without  doubt,  the  most  commonly  used  transformation  is  the  Fourier  transfenm, 
but  other  transftnmations  such  as  the  Cepstrum,  the  Gabor  transform  and  the  spectrogram  are 
used  as  well.  Recently,  Principal  Component  Analysis  (PCA)  has  received  attention  [10].  This 
technique  reduces  the  number  of  dimensions  of  a  problem,  but  keeps  the  parameters  fw  which 
the  eigen  vectors  of  the  autocmrelation  matrix  have  the  greatest  energy.  When  the  dynamic 
ran^  of  a  variable  goes  beyond  several  octaves,  a  logarithmic  transformation  may  be  necessary 
to  retain  die  small  variations  that  might  otherwise  be  lost  in  normalization.  There  are  other 
techniqires  such  as  Feature  Extraction  in  which  only  the  important  characteristics  of  the  raw  data 
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are  presented  to  the  netwoik  as  inputs.  This  reduces  the  work  and  complexity  of  the  network  but 
one  must  ensure  the  relevance  of  the  chosen  characteristics.  In  general,  it  is  difficult  to  know 
in  advance  which  transformations  will  produce  the  best  results.  Experimentation  is  often  required 
to  determine  the  best  approach. 

If  the  network  is  to  operate  in  a  noisy  environment,  then  once  the  inputs  have  been 
transformed  and  normalized  to  fit  the  application,  appropriate  noise  must  be  added  to  any 
synthetically  generated  training  vectors.  Freeman  and  Skapura  [4]  have  found  that  noise  added 
to  the  inputs  can  facilitate  the  convergence  of  the  network  even  if  the  network  is  destined  to 
operate  in  a  clean  environment 

4,2  Weight  Initialization  and  Adjustment 

Before  training  begins,  the  weights  in  the  network  are  randomly  initialized.  It  is 
recommended  that  values  of  the  weights  lie  between  ±0.5  (some  authors  recommend  even  lower 
limits).  Doing  so  prevents  the  neurons  from  saturating  at  the  outset  of  training.  Once  training 
is  in  progress,  it  is  important  that  the  training  vectors  be  presented  in  a  random  order  so  that  the 
network  learns  the  charactenstics  of  the  inputs  rather  than  a  specific  order  of  inputs.  Of  even 
greater  importance  is  to  avoid  presenting  all  of  the  vectors  of  one  class,  followed  by  the  vectors 
of  the  second,  and  so  on.  There  is  a  risk  that  the  network  will  forget  what  it  has  learned  from 
the  preceding  class  during  the  training  of  the  next  class. 

Another  method  of  preventing  saturation  in  the  nodes  is  to  add  a  small  quantity  e  to  the 
derivative  of  the  transfer  function.  As  the  adjustment  of  the  weights  depends  on  the  derivative, 
this  e  permits  a  slight  adjustment  to  nodes  that  would  otherwise  cease  to  learn. 

4  J  Choosing  the  Number  of  Training  Vectors 

Choosing  the  correct  number  of  training  vectors  is  an  important  consideration.  When 
there  are  not  enough  vectors  in  the  training  set,  there  is  a  risk  that  the  network  will  fail  to 
generalize.  The  network  will  produce  correct  results  for  vectors  in  the  training  set  but  will  fail 
on  test  vectors.  According  to  the  Baum  and  Haussler  result  [6],  if  the  allowable  error  for  any 
test  vector  is  to  be  less  than  e  then  the  network  will  "almost  certainly"  generalize  if  the  allowable 
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emv  for  the  training  vectors  is  less  than  e/2  and  the  number  of  training  vectors  is  greater  than 
Wit  where  W  equals  the  number  of  weights  in  the  network.  For  example,  if  the  test  vectors  are 
to  have  errors  less  than  0.2  and  there  are  200  connections  in  the  network,  then  there  should  be 
at  least  1000  training  vectors  and  the  training  should  proceed  until  die  value  of  the  error  function 
is  less  than  0.1. 

4.4  Network  Design 

The  decision  concerning  the  number  of  layers  and  nodes  to  use  for  a  particular  ^plication 
is  of  great  importance  as  it  influences  the  likelihood  that  the  errm*  function  converges  and  that 
the  networic  generalizes.  While  there  are  heuristics  to  guide  these  choices,  experimentation  is 
often  required  to  achieve  acceptable  results.  This  experimental  aspect  is  one  of  the  prime  sources 
of  criticism  directed  towards  the  neural  network  approach. 

4.4.1  Number  of  Layers 

Because  the  computational  performance  of  the  network  is  directly  proportional  to  the 
number  of  layers  and  nodes,  it  is  desirable  to  limit  these  values  as  muc!<  as  possible.  It  has  been 
proposed  that  in  most  cases  one  or  two  hidden  layers  is  sufficient  The  neural  network 
interpretation  of  Kolmogorov’s  theorem  by  Hecht-Nielsen  proves  that  a  network  with  a  single 
hidden  layer,  whose  nodes  have  transfer  functions  that  are  not  constrained  to  be  the  same,  can 
represent  an  arbitrary  function  of  the  inputs  [10].  Unfortunately,  the  theorem  does  not  say  how 
to  choose  the  transfer.functions  and  has  therefore  little  practical  application.  Cybenko  shows  that 
two  hidden  layers,  whose  nodes  all  use  sigmoidal  transfer  functions,  are  sufficient  to  compute 
an  arbitrary  function  of  the  inputs,  and  that  a  single  hidden  layer  is  sufficient  for  classification 
problems  [10].  According  to  Maren  et  al.  [10]  experimentation  has  confirmed  these  results. 

4.4.2  Number  of  Modes  Per  Layer 

The  number  of  nodes  in  the  input  layer  is  fixed  by  the  number  of  points  in  the  input 
vector.  For  the  output  layer,  the  number  of  nodes  is  set  to  the  number  of  desired  outputs  (e.g. 
classes,  categories,  probabilities).  Any  attempt  to  reduce  the  number  of  nodes  in  the  output  layer 
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by  an  encoding  of  the  outputs  should  be  applied  with  caution,  as  the  additional  buiden  imposed 
on  the  network  may  only  prove  workable  with  the  inclusion  of  a  second  hidden  layer. 

Qioosing  the  number  of  nodes  in  the  hidden  layers  is  mote  difficult  The  objective  is  to 
keep  the  number  of  nodes  to  a  minimum  as  the  addition  of  extra  nodes  can  result  in  a  network 
tiutt  learns  particular  cases,  yet  fails  to  learn  general  characteristics,  as  well  as  increasing  the 
training  and  operational  complexity.  Here  again,  the  decision  can  be  guided  by  theoretical  and 
empirical  limits.  Accmding  to  Hecht-Nielson’s  interpretation  of  the  Kolmogmov  theorem,  7N 
-I-  1  nodes  (where  N  is  the  number  of  inputs)  are  necessary  to  calcuLite  an  arbitrary  function. 
According  to  Kudrycki,  the  optimum  ratio  between  the  first  and  the  second  hidden  layer  is  three 
to  one  [10].  Fot  small  netwtvks  where  the  number  of  inputs  is  greater  than  the  number  of 
outputs,  it  has  been  pr(qx>sed  that  the  geometric  average  of  the  inputs  and  outputs  provides  a 
good  estimate  of  the  optimum  number  of  nodes  to  use.  There  is  general  agreement  however,  that 
the  number  of  nodes  in  each  hidden  layer  should  be  fewer  than  the  number  of  inputs. 

4,5  Summary 

This  chapter  has  dealt  with  practical  considerations  in  the  design  of  a  network.  The  tw  3 
main  issues  are  ccmvergence  of  the  error  function  and  network  performance.  A  number  of 
techniques  that  address  these  issues  were  presented,  including  pre-treatment  of  the  input  data, 
weight  initialization  and  adjustment,  the  number  of  vectors  in  the  training  set,  and  the  choice  of 
the  number  of  layers  and  nodes  in  the  network. 

5.0  CLASSIFICATION  OF  RADAR  SIGNALS 

Tlw  advent  of  complex  radar  systems  has  lead  to  the  appearance  of  frequency  and  pulse 
repetition  interval  ^RI)  agile  emitters.  Emitters  can  vary  these  parameters,  eithor  randomly  or 
periodically  following  a  well  defined  modulation  pattern.  The  application  of  neural  networks 
described  in  this  paper  is  limited  to  the  common  modulation  patterns  shown  in  Figure  4.  (The 
vertical  and  horizontal  axes  represent  the  frequency  of  the  pulse  and  the  time  of  arrival 
respectively).  These  include  the  sinusoid,  the  triangular  wave,  the  sawtooth  wave  (including  both 
negative  and  positive  sloping  teeth)  and  the  rectangular  wave.  A  signal  exhibiting  a  constant 
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pulse  frequency  or  PRI  could  also  be  viewed  as  having  a  distinct  modulation  pattern,  but  diere 
are  already  simple  algoridiins  to  detect  such  signals. 

Both  pulses  and  modulation  patterns  are  measured  in  frequency.  The  frequency  of  a  pulse 
is  typically  measured  in  megahertz  or  gigahertz,  while  the  frequency  of  the  modulation  pattern 
will  be  in  the  order  of  millihertz  or  even  hertz.  The  use  of  the  term  signal  refers  to  the  stream 
of  pulses  received  by  an  ESM  system. 

This  duster  describes  the  results  of  a  number  of  experiments  that  were  devised  to 
determine  tlw  iq>plicability  of  neural  networks  to  the  recognition  and  classification  of  such 
modulation  patterns,  as  well  as  to  assess  the  performance  of  different  network  architectures  and 
die  effect  of  treating  the  data  by  transforming  it  from  the  dme  domain  to  the  frequency  dcHnain. 
In  each  experiment,  a  network  is  presented  with  a  number  of  training  vectors,  followed  by  a 
number  of  test  vectors.  Each  vector  represents  a  stream  of  pulses  received  in  one  illumination. 
All  network  inputs,  both  training  vectors  and  test  vectors,  are  co^^puter  generated  syndiedc  data. 


bltilongularwave 

1 
0 
-1 

Figure  4:  Signal  Modulation  Patterns 


5.1  Time  Domain  Experiments 

The  goal  of  a  time  domain  network  is  to  classify  an  input  test  vector  as  one  of  the 
following  modulation  pattern  types:  sine,  triangular,  negative  sloping  saw-tooth,  positive  sloping 
saw-tooth  or  square.  Two  assumptions  are  made  in  generating  the  data.  First,  it  is  assumed  that 
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all  radar  pulses  seen  in  an  illumination  are  equidistant  in  time,  i.e.  constant  PRI,  and  second,  it 
is  assumed  duu  die  signal  average  is  zoo,  i.e.  no  DC  component  is  added  to  the  signal  generation 
equations.  (For  real  data,  a  jm-tteatment  step  to  subtract  the  average  value,  assuming  it  is 
non>mo,  is  necessary.)  It  is  also  assumed  that  dw  parameters  vary  as  follows: 

Time:  The  length  in  time  of  an  illumimuion  is  variable.  In  other  wtxds,  an 
illumination  may  contain  a  variable  number  of  pulses. 

Frequency:  The  fiequency  of  the  modulation  pattern  can  also  be  viewed  as  a 
variation  in  the  number  of  periods  ccmtained  in  one  illumination.  The 
number  is  assumed  to  be  relatively  small,  varying  between  one  period  and 
three  periods. 

Phase:  An  illumination  can  start  at  any  point  in  the  cycle  of  the  modulation 
pattern;  the  phase  shift  varying  anywhere  between  0  tnd  2ic. 

The  classifier  must  find  shared  characteristics  in  the  inputs,  subject  to  the  above  parameter 
variations,  that  permit  it  to  differentiate  between  different  signal  patterns. 

All  of  the  networks  used  in  the  experiments  are  of  the  back>propagation  type,  all  have  a 
single  hidden  layer,  and  all  use  the  cumulative  update  training  method.  The  elements  of  an  input 
vector  represent  the  pulse  frequencies  in  a  single  illumination.  The  ouqiut  vector  contains  one 
element  for  each  possible  modulation  pattern.  The  element  with  the  highest  value  is  taken  to  be 
die  classificatitm  of  the  input  signal. 

Two  sets  of  experiments  were  carried  out.  In  the  first,  die  number  of  pulses  per 
illuminaticm  is  assumed  to  be  no  more  than  16,  and  in  the  second,  no  more  dian  32.  Where  the 
illumination  contains  fewer  pulses  than  the  maximum,  the  corresponding  vector  is  padded  with 
zeroes. 

Befine  generating  training  and  test  vectors,  the  number  of  nodes  in  the  various  layers  must 
be  (teckled  upon.  According  to  the  Baum  and  Haussler  result,  in  order  to  obtain  an  ertOT  of  less 
than  e  in  the  test  vectors,  the  number  of  training  vectors  must  be  at  least  1/e  times  the  number 
of  connections  in  the  network.  For  these  experiments  e  was  chosen  to  be  0.2.  Therefore  the 
number  of  training  vectors  must  be  at  least  five  times  the  number  of  connections  in  the  network. 
To  compare  the  relative  merits  both  in  terms  of  performance  and  training  times,  five  different 
netwOTks  were  built  for  each  set  of  experiments.  In  the  first  network,  the  number  of  nodes  in 
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the  hklden  layer  was  equal  to  die  number  of  nodes  in  the  output  layer,  in  the  second  network, 
the  number  ttf  nodes  in  the  hidden  layer  was  equal  to  two  times  the  number  of  nodes  in  the 
oaqnit  layer,  and  so  on.  This  was  done  to  test  the  Kudrycld  hypoth^  that  the  ratio  cf  the 
number  oi  nodes  in  the  hidden  layer  to  the  ouqiut  layer  should  be  dnee  to  one. 

The  notation  used  in  this  pqier  to  define  the  architecture  of  the  various  networics  is  i-h-o 
where  i  indicates  the  number  of  nodes  in  the  input  layer,  h  the  number  of  nodes  in  die  hidden 
layer,  and  o  die  number  of  nodes  in  the  output  layer.  The  networks  used  in  the  first  set  of 
experiments  are  therefore  denoted  as  16-5-S,  16-10-5,  16-13-5,  16-20-5,  and  16-25-5. 

The  number  of  weights  or  connections  in  a  three  layer  back-propagaticm  network  is  given 
by  (i  *  h)-¥(h*  o).  For  the  above  networks  this  yields  105, 210, 315, 420,  and  525  connections 
respectively.  Using  the  Baum-Haussler  rule,  dre  number  of  training  vectors  to  be  geirerated  for 
each  netwmk  should  be  at  least  525,  1050,  1575,  2100  and  2626  respectively. 

In  generating  the  training  vecttMS,  there  are  four  parameters  of  concern:  the  number  of 
different  classifications  (or  output  nodes),  the  variation  in  the  number  of  pulses  in  the 
illumination  (in  this  cassi,  some  number  between  1  and  16),  the  variation  in  die  number  of  period:; 
in  the  illumination  and  the  variation  in  the  ph&*e.  The  basic  algorithm  for  generating  the  training 
vectors  is  as  follows: 

FOR  EACH  classification  type  c  DO 

FOR  X  =  pulse  rain  TO  pulse  max  BY  pulse  step  DO 

FOR  y  *  period  rain  TO  period  max  BY  period  step  DO 
FOR  z  =  phase  min  TO  phase  max  BY  phase  step  DO 
Generate  a  new  vector  v  (c,  x,  y,  z) ; 

Generate  the  corresponding  output  vector  w; 

Output  V  and  w; 

END 

END 

END 

END 

To  avoid  saturation,  the  entries  in  the  input  training  and  test  vectors  are  computed  so  as 
to  lie  within  the  range  of  -1  to  +1,  and  as  Freeman  [4]  suggests,  die  entries  in  the  output  training 
vector  all  lie  within  the  range  of  0.1  to  0.9. 

To  give  equal  weight  to  each  of  the  variations,  the  number  of  possible  values  taken  on 
by  each  of  x,  y  and  z  in  the  above  algorithm  should  be  the  same,  say  n.  The  number  of  training 
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vectors  generated  is  then  equal  to  Cn^,  where  C  is  the  number  of  classificatkms,  i.e.  the  number 
ci  ouQMit  nodes.  To  satisfy  the  Baum-Haussler  result  n  must  be  siK:h  that  Cr^  is  greater  than  or 
equal  to  five  times  the  number  of  connections  in  the  network.  The  value  of  n  can  therefore  be 
computed  by  taking  the  ceiling  of  the  cube  root  of  S/C  times  the  number  of  connections  in  the 
networiL  Since  C  is  equal  to  five  in  this  case,  n  is  simply  equal  to  the  ceiling  of  the  cube  root 
of  the  number  of  connections  in  the  network.  Table  I  shows  the  minimum  number  of  training 
vecton  to  be  generated  to  satisfy  Baum-Hausslo*.  and  the  actual  number  of  training  vectors  that 
were  generated  for  all  of  the  time  domain  netwmk  experiments. 

The  parameters  x,  y,  and  z  are  evenly  spread  across  the  desired  range  of  values.  Fot 
example,  if  the  period  of  the  modulation  pattern  is  to  vary  between  one  and  three,  and  five  values 
are  to  be  computed,  then  the  values  will  be  1.0, 1.5, 2.0, 2.5  and  3.0.  If  seven  values  are  needed, 
the  values  could  be  1.002, 1.335, 1.668, 2.001, 2.334, 2.667,  and  3.0.  Table  n  shows  the  number 
oi  values  desired  fw  each  variation  (n),  and  the  minimum,  maximum  and  step  size  required  to 
compute  values  for  x,  y,  and  z.  The  first  five  rows  show  the  range  of  values  used  to  generate  the 
trairting  vectors  for  the  16  input  networks.  The  sixth  row  shows  the  ranges  used  to  generate  the 
test  vecims  used  in  testing  all  of  the  16  input  networks.  The  ranges  used  to  generate  tire  training 
and  test  vectors  fw  die  32  input  networks  are  shown  in  the  subsequent  six  rows. 

The  test  vectens  are  generated  using  the  same  algorithm,  although  care  is  taken  that  no 
test  vector  is  identical  to  any  of  the  training  vectors.  This  is  to  ensure  that  the  network  is  tested 
against  previously  unseen  vectors. 

The  second  set  of  experiments  was  carried  out  using  networks  with  32  entries,  i.e,  32-5-5, 
32-10-5, 32-15-5, 32-20-5,  and  32-25-5.  Apart  from  the  change  in  dimension  of  the  input  vector, 
the  same  algmithm  was  used  to  generate  the  training  and  test  vectors. 

The  actual  experimental  process  consisted  of  generating  the  training  and  test  vectors,  and 
building  the  corresponding  network  using  the  commercial  produci  NeuralWtnks  Professional  n, 
produced  by  NeuralWare  Inc.  Each  network  was  initially  presented  with  20,000  training  vectors 
randomly  chosen  from  the  generated  training  vector  set.  (The  number  20,000  was  determined 
experimentally  to  be  a  convenient  interval  for  assessing  the  improvement  in  performance  over 
time.)  The  performance  of  the  network  was  then  determined  by  computing  the  percentage  of 
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comedy  classified  signals  represented  by  the  test  vectors.  Every  network  used  in  these 
erqwriments  has  five  outputs,  each  output  corresponding  to  a  particular  classification.  A  signal 
is  considered  to  be  comedy  classified  if  the  value  in  its  associated  output  node  is  greater  dian 
any  value  in  die  other  output  nodes.  Although  this  is  a  weak  measure  of  success,  it  is  sufficient 
for  comparison  purposes.  (A  stronger  measure  of  correemess  would  insist  diat  not  only  should 
die  associated  ouqiut  node’s  value  be  the  greatest,  but  that  it  exceed  some  threshdd  as  well.) 
In  addition  to  the  perftmnance  of  the  network,  the  length  of  time  required  to  train  the  network 
widi  die  20,000  vectors  is  recorded.  The  training  and  testing  process  is  repeated  for  a  mtal  of 
20  times  (400,000  presentadons  of  training  vectors).  Table  in  shows  the  results  for  the  networks 
with  16  inputs,  and  Table  m.  Table  IV  shows  the  results  for  the  netwmks  with  32  inputs. 
Hgure  S  to  Figure  14  in  Appendix  A  show  the  results  in  a  gnqihical  format 
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Netvraric 

Number  of 

Coimea- 

ioiis 

(Weights) 

Required 

Number  of 

Training 

Vecton 

Number  of 

Variations 

in) 

Actual 

Numba  of 

Training 

Vectors 

16-S-S 

105 

525 

5 

625 

16-10-5 

210 

1050 

6 

1080 

16-15-5 

315 

1575 

7 

1715 

16-20-5 

420 

2100 

8 

2560 

16-25-5 

525 

2625 

9 

3645 

32-5-5 

185 

925 

6 

1080 

32-10-5 

370 

1850 

8 

2560 

32-15-5 

555 

2775 

9 

3645 

32-20-5 

740 

3700 

10 

5000 

32-25-5 

925 

4625 

10 

5000 

Table  I:  Hme  Domain:  Number  of  Training  Vectors 
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Net 

1 

Pulse 

min 

Pulse 

max 

Pulse 

stq> 

Period 

min 

Period 

max 

Period 

step 

Phase 

min 

Phase 

max 

Phase 

step 

16-5-5 

5 

12.0 

16.0 

1.0 

1.0 

3.0 

0.5 

0.0 

0.996 

0.249 

16-10-5 

B 

m 

16.0 

1.0 

1.0 

0.199 

16-15-5 

fl 

10.0 

16.0 

1.0 

1.002 

0.0 

_ 1 

16-20-5 

B 

9.0 

16.0 

1.0 

m 

16-25-5 

B 

8.0 

16.0 

IB 

Test  16 

B 

9.0 

15.0 

2.0 

32-5-5 

B 

m 

IB 

0.0 

32-10-5 

8 

1.005 

_ 1 

32-15-5 

B 

24.0 

32.0 

1.0 

32-20-5 

10 

23.0 

IB 

0.0 

32-25-5 

10 

m 

1.0 

3.0 

0.222 

Test  32 

B 

EB 

2.0 

1.001 

2.999 

0.666 

0.0 

B 

Table  IL*  Hme  Domain:  Pulse  Count,  Period  and  Phase  Settings 
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16-10-5 


16-15-3 


16-20-5 
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SJ2  The  Frequmcy  Domain 

One  of  dw  ^mIs  this  project  is  to  analyze  the  performance  of  neural  netwmks  using 
data  diat  has  been  transformed  from  the  time  domain  to  the  frequency  domain.  Transforming 
data  in  this  way,  most  commonly  using  a  Fourier  Transfcnrm,  is  a  common  technique  ftx* 
analyzing  signals  in  digital  signal  processing.  The  Fourier  transfwm  is  typically  applied  to  an 
amplitude  versus  time  representation  of  a  signal  This  project,  however,  deals  with  the 
modulation  of  pulse  fiequencies  over  time.  A  transformation  into  the  frequency  domain  in  this 
case  yields  a  pulse  frequency  versus  modulation  frequency  representation  of  the  signal. 

Note  tiiat  in  a  fielded  application  the  additional  cost  of  pre-treating  data  must  be  weighed 
against  any  possible  benefits. 

5J.1  The  Fourier  Transform 

The  Fast  Fourier  Transform  is  an  algorithm  that  quickly  computes  the  Fourier  transform 
of  a  signal  To  use  this  algorithm,  it  is  necessary  that  the  number  of  points  in  the  signal  be  a 
power  of  two.  However,  the  number  of  points  (pulses)  is  variable  and  padding  the  signal,  (i.e. 
the  input  vector)  with  zero  entries  causes  a  problem:  the  discontinuity  introduced  with  the 
addititm  of  die  zero  entries  can  significandy  alter  the  behaviour  of  the  amplitude  derived  by  the 
Fourier  transform  (the  amplitude  of  the  signal  being  the  pulse  frequency  in  our  applicatitm).  To 
remove  this  discontinuity,  the  signal  values  (pulse  fiequencies)  are  multiplied  by  a  Blackman 
[1 1]  window  prior  to  the  addition  of  the  zero  values.  This  has  the  property  of  creating  a  function 
whose  values  begin  at  zero,  gradually  rises  to  a  maximum  of  one  at  the  centre  of  the  window  and 
then  gradually  descend  towards  zero: 

^  ^  (0.42  -  O.Soo6(2zit/Af)  +  0.08cos(4ii/t(Af) 

^  *  t  0,0  otherwise 

The  new  signal  therefore  has  a  value  close  to  zero  before  the  zero  data  points  are  added. 

It  is  also  important  to  note  that  wiui  the  Fourier  transformation,  the  two  variations  of  the 
sawtooth  wave  pattern  will  result  in  the  identical  function  in  the  frequency  domain  (the 
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transfomiatioii  of  the  negative  and  positive  wave  patterns  differing  by  a  phase  shift  ttf  180 
degrees).  The  total  number  of  distinct  signal  classifications  is  therefore  reduced  to  four.  The 
distiiiction  between  the  two  sawtooth  patterns  can  be  easily  made  after  the  neural  iwtwoik  has 
produced  its  ouqtut 


522  Frequency  Domain  Experiments 

As  for  the  time  domain  experiments,  separate  sets  of  training  and  test  vectors  were 
generated  to  serve  as  input  to  a  new  set  of  neural  networks.  For  these  experiments,  five  networks 
with  16  input  nodes  were  created  ranging  from  four  hidden  nodes  to  twenty:  16-4-4, 16-8-4, 16- 
12-4, 16-16-4,  and  16-20-4,  and  five  netwtxks  with  32  inputs:  32-4-4,  32-8-4,  32-12-4, 32-16-4, 
and  32-20-4.  The  basic  algorithm  to  generate  the  input  vectors  is  the  same.  Again,  the 
parameters  that  can  vary  are  the  classification  of  the  signal,  the  number  of  pulses  in  the  signal, 
the  number  of  periods  in  the  illumination  and  the  phase  of  the  signal: 


FOR  EACH  classification  type  c  DO 

FOR  X  =  pulse  min  TO  pulse  max  BY  pulse  step  DO 

FOR  y  =  period  min  TO  period  max  BY  period  step  DO 
FOR  phase  min  TO  phase  max  BY  phase  step  DO 
Generate  a  new  vector  v  {c,  x,  y,  z) ; 

Apply  Blackman  window  to  v; 

Apply  Fourier  Transform  to  v; 

Generate  the  corresponding  output  vector  w; 
Output  V  and  w; 

END 

END 

END 

END 


Table  V  lists  the  number  of  training  vectors  to  be  generated  for  each  network.  Table  VI 
shows  the  ranges  and  step  sizes  used  to  specify  the  r,  y  and  z  values  of  the  algorithm.  Table  Vn 
shows  the  results  for  the  netwoiks  with  16  inputs,  and  Table  Vm  shows  the  results  for  the 
networks  with  32  inputs.  Figure  IS  to  Hgure  24  in  Appendix  A  show  the  results  in  a  graphical 
format 
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Network 

Number  of 

Connect¬ 

ions 

(Weights) 

Required 

Number  of 

Training 

Vectors 

Number  of 

Variations 

in) 

Actual 

Number  of 

Training 

Vectors 

16-4-4 

80 

400 

5 

500 

16-8-4 

160 

6 

864 

16-12-4 

240 

1200 

7 

1372 

16-16-4 

320 

1600 

8 

2048 

16-20-4 

400 

2000 

8 

2048 

32-4-4 

144 

720 

6 

864 

32-8-4 

288 

1440 

8 

2048 

32-12-4 

432 

2160 

9 

2916 

32-16-4 

576 

2880 

9 

2916 

32-20-4 

720 

3600 

10 

4000 

Table  V:  Frequency  Domain:  Number  of  Training  Vectors 
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Net 

n 

Pulse 

Pulse 

Pulse 

Period 

Period 

Period 

Phase 

ru- 

Phase 

I%ase 

min 

max 

step 

min 

max 

_ 

step 

min 

max 

step 

16-4-4 

5 

1.0 

0.0 

0.996 

16-8-4 

6 

0.0 

0.995 

ISBIB 

0.333 

0.0 

0.996 

0.166 

1  16-16-4  1  8 

1.005 

3.0 

0.285 

0.142 

0.285 

0.142 

m 

1.001 

3.0 

0.666 

0.333 

1  32-4-4 

1  6 

t 

_l _ 

1.0 

0.4 

0.199 

n 

1.0 

_ 

1.005 

0.0 

0.994 

0.142 

0.0 

0.992 

0.124 

0.0 

32-20-4 

]  10 

J _ 

0.222 

0.0 

0.999 

0.111  1 

test32 

11 

2.0 

1.001 

2.999 

0.0 

0.999 

0.333 

TaMe  VI:  Frequency  Domain:  Pulse  Count,  I^od  and  Phase  Settings 
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32-12-4 


32-16-4 


32-204 
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$3  Ob&tnmtkm 


The  experiments  show  that  either  type  of  network  can  be  used  successfully  to  classify 
radar  signals.  However,  the  acceptability  of  a  network’s  performance  depends  on  the  application 
and  is  usually  defined  as  the  ability  of  the  neural  network  to  achieve  stnne  threshold  of  success 
over  a  set  of  test  vecton.  It  is  important  to  note  that  in  general,  neural  networks  cannot 
guarantee  a  one  hundred  percent  success  rate  f<x  all  possible  inputs.  (One  of  the  networks  in  the 
experinmnts  (32-2S-S)  did  achieve  a  success  rate  of  one  hundred,  but  this  was  on  a  limited  set 
of  test  vectors). 

For  example,  if  a  95  percent  success  rate  is  considered  to  acceptable,  then  the  time 
domain  netwmks:  16-20-5,  16-25-5,  32-15-5, 32-20-5,  and  32-25-5  all  pass  this  critericm  based 
on  dm  set  of  test  vectors  {xesented  to  them.  In  the  frequency  domain,  the  networks:  32-8-4,  32- 
12-4, 32-16-4,  and  32-20-4  meet  this  criterion.  However,  if  the  !q>plication  calls  for  a  99  percent 
success  rate  than  only  die  time  (krniain  networks  32-15-5,  32-20-5,  and  32-25-5  can  satisfy  this 
requiremenL  Ncme  oi  the  firequency  dcxnain  netwmks  meet  this  standard. 

The  experimental  results  concur  with  Kudrycld’s  result  that  the  number  of  nodes  in  the 
hidden  layer  must  be  at  least  three  times  the  number  of  nodes  in  the  output  layer  to  obtain 
adequate  performance.  Apart  from  the  frequency  domain  network  32-8-4,  the  highest  rate  of 
success,  for  any  of  the  networks  having  equal  (x  double  the  number  of  nodes  in  the  hidden  layer 
as  in  die  output  layer,  was  91.406  percent  For  the  netwmks  with  only  16  input  nodes,  the 
highest  rating  was  76.953  percent  Equal  or  better  performance  is  obtained  from  networks  whose 
hidden  layer  had  four  or  frve  times  the  number  of  nodes  in  the  ouqiut  layer.  This  greater  degree 
of  success,  however,  was  achieved  at  the  cost  of  additional  training  time.  It  should  also  be  noted 
dut  diese  larger  netwmks  cannot  execute  as  fast  as  the  networks  with  fewer  axmections,  (unless 
parallel  processing  techniques  are  employed) . 

The  training  time  was  found  to  be  proportional  to  the  number  of  connections  in  the 
netwoilc  and  in  all  cases  frtiiiy  lengthy.  The  times  shown  in  Tables  m,  IV,  Vn,  and  Vm  indicate 
only  the  training  time,  and  yet  this  alone  adds  up  to  more  than  30  hours.  This  total  does  not 
include  the  time  required  to  generate  the  training  and  test  vectors,  the  time  to  build  the  networks, 
the  time  to  run  the  tests,  nor  the  time  to  record  the  results.  The  entire  process  actually  required 
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moR  dian  two  person-weeks.  However,  to  achieve  the  same  results  using  traditional  programming 
methods  would  require  many  times  diis  effort  Of  course,  if  one  decides  in  advance  of  the 
training  process  that  a  certain  percent  success  rate  is  acceptable,  then  the  training  time  can  be 
reduced  by  stopping  die  training  process  when  this  level  of  success  has  been  met  One  still  needs 
an  iqqier  limit,  as  the  success  rate  may  never  reach  this  level.  In  the  experiments  perfumed  in 
diis  project,  the  number  oi  iterations  of  training  (400,000)  was  chosen  arbitrarily  on  the 
assumption  that  if  the  training  error  function  had  failed  to  converge  after  this  amount  of  training, 
then  it  was  unlikely  to  ever  do  so.  The  reader  will  note  that  even  after  4(X),0(X)  iterations  of 
training,  8  of  the  20  networks  had  failed  to  reach  the  95  percent  threshold. 

In  comparing  the  general  petfnmance  of  the  two  types  of  netwt^  it  is  clear  that  the 
time  domain  networks  were  the  better  performers.  Taking  a  95  percent  success  rate  as  the 
acceptable  performance  criterion,  two  of  the  16  input  time  domain  networks  16-20-5  and  16-25-5 
were  able  to  achieve  this  rate,  whereas  the  highest  success  rate  of  all  of  the  16  input  frequency 
domain  networks  was  only  76.953.  To  achieve  this  level  of  success  in  the  frequency  domain 
requires  a  32  input  network.  This  is  significant  because  it  is  generally  easio*  to  generate  data  for 
networks  with  smaller  input  vectors,  and  networks  with  fewer  cormections  result  in  faster 
execution  times.  The  actual  time  required  to  obtain  the  95  percent  rate  for  the  time  domain 
netwuk  16-20-5  was  55  minutes  and  30  seconds,  whereas  one  hour,  17  minutes  and  14  seconds 
was  required  to  achieve  this  rating  for  the  frequency  domain  network  32-8-4.  Note  also  that  in 
die  frequency  domain  case,  additional  time  is  required  to  perform  the  required  transformations 
in  generating  the  trairung  and  test  vectors.  The  poom’  performance  of  the  frequency  domain 
networks  could  be  a  result  of  the  Blackman  window  transformation.  The  distortion  of  the  input 
data  caused  by  this  transformation  could  increase  the  similarity  between  two  modulatitm  pattern 
types  diereby  decreasing  the  network’s  ability  to  distinguish  between  the  two  different  patterns. 

6.0  CONCLUSIONS 

This  ptqier  has  described  the  fundamentals  of  neural  networks:  their  construction, 
training,  and  behaviour.  Particular  attention  has  been  payed  to  the  class  of  neural  networks  called 
back-propagation  networks  due  to  the  simplicity  of  their  training  algorithm  and  their 
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efibcdveness  in  sdving  classification  type  problrais.  Theoretical  issues  involving  the 
aidutecture,  tndning  algorithm,  the  mcecution  algorithm,  and  several  methods  for  accelmating  the 
training  process  this  class  of  neural  networks  were  covered.  Practical  consideration  were  dealt 
with  as  well.  These  include  the  pre-treatment  of  input  data,  inidalizatitm  of  wei^ts,  the  choice 
in  die  number  of  training  vecton,  the  number  of  layers  and  the  number  of  nodes  in  each  layw. 

This  badcground  material  was  dien  followed  by  a  descriptitm  of  a  series  of  experiments 
iriiich  investigated  die  lyiplicability  of  die  neural  network  technique  in  classifying  the  modulatitm 
patterns  of  pulse-to-pulse  finequency  agile  emitters.  The  experiments  showed  that  it  is  possible 
to  construct  a  neural  network  to  perform  this  classification  task.  They  also  showed  that  there  are 
no  clear-cut  rules  to  follow  in  die  construction  of  a  netwwk.  It  is  somewhat  of  a  black  art  — 
one  must  rely  on  rules  of  diumb  and  "twealdng”  to  produce  a  successful  network. 

As  well,  the  experiments  showed  that  several  netwOTks  with  different  architectures  could 
be  trained  to  perform  the  same  task  successfully,  hi  designing  a  particular  neural  netwmk  for 
a  specific  application,  one  is  generally  finced  tt>  make  a  trade-off  between  network  performance, 
training  time,  and  execution  time.  To  find  die  qitimal  network,  the  designer  must  experiment 
widi  different  architectures  and  different  representations  of  the  input  data.  This  can  be  a  time 
consuming  tadc,  involving  the  construction,  training  and  evaluatirm  of  multiple  networks.  These 
tasks  ate  relatively  simple  ones,  certainly  much  simpler  than  implementing  a  classification 
algoridim  in  software.  The  problem  of  classifying  the  five  signal  patterns  used  in  this  project, 
a  problem  easily  solved  by  a  human,  is  actually  very  difficult  to  solve  using  tradititmal  software 
techniques,  particularly  when  the  inputs  are  allowed  to  vary  as  much  as  they  do  in  diis  project 
(variable  number  of  inputs  and  periods  with  arbitrary  phase). 

Finally,  neural  networks  ate  a  promising  technology  for  radar  signal  classification 
problems.  However,  mcne  experimentation  is  required.  For  example,  to  test  performance  when 
diere  is  a  greater  variability  in  the  possible  signal  data.  This  project  also  made  restrictive 
assumpticms  omcerning  the  nature  of  the  sigiud  data,  eg.,  fixed  PRI,  a  minimum  of  eight  pulses, 
and  at  least  tme  full  period  of  tire  modulation  pattern  per  illumination.  While  it  is  true  that  any 
technique  will  have  difficulty  classifying  a  signal  with  less  than  eight  pulses  and  only  a  firactional 
portion  of  one  modulation  pattern,  an  effective  classifier  must  be  able  to  handle  signals  exhibiting 
PRI  agility. 
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Appendix  A:  Hgures  of  Experimental  Results 
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Figure  6:  16-10-S  Results 


Figure  7:  16>1S-S  Results 


Figure  9:  16-2S-S  Results 
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Time  32-10^  Training  Results 
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Flfnre  19:  16-20-4  Results 
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